KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Liu, Zirui; Yuan, Jiayi; Jin, Hongye; Zhong, Shaochen; Xu, Zhaozhuo; Braverman, Vladimir; Chen, Beidi; Hu, Xia

doi:10.13140/RG.2.2.28167.37282

Computer Science > Computation and Language

arXiv:2402.02750 (cs)

[Submitted on 5 Feb 2024 (v1), last revised 25 Jul 2024 (this version, v2)]

Title:KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Authors:Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu

View PDF HTML (experimental)

Abstract:Efficiently serving large language models (LLMs) requires batching of many requests to reduce the cost per request. Yet, with larger batch sizes and longer context lengths, the key-value (KV) cache, which stores attention keys and values to avoid re-computations, significantly increases memory demands and becomes the new bottleneck in speed and memory usage. Additionally, the loading of the KV cache causes the computational core to be idle, which limits the inference speed. A straightforward and effective solution to reduce KV cache size is quantization, which decreases the total bytes taken by KV cache. However, there is a lack of in-depth studies that explore the element distribution of KV cache to understand the hardness and limitation of KV cache quantization. To fill the gap, we conducted a comprehensive study on the element distribution in KV cache of popular LLMs. Our findings indicate that the key cache should be quantized per-channel, i.e., group elements along the channel dimension and quantize them together. In contrast, the value cache should be quantized per-token. From this analysis, we developed a tuning-free 2bit KV cache quantization algorithm named KIVI. With hardware-friendly implementation, KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $\mathbf{2.6\times}$ less peak memory (including model weight). This reduction in memory usage enables up to $\mathbf{4\times}$ larger batch size, bringing $\mathbf{2.35\times \sim 3.47\times}$ throughput on real LLM inference workload. The source code is available at this https URL.

Comments:	ICML2024
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG); Performance (cs.PF)
Cite as:	arXiv:2402.02750 [cs.CL]
	(or arXiv:2402.02750v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2402.02750
Related DOI:	https://doi.org/10.13140/RG.2.2.28167.37282

Submission history

From: Jiayi Yuan [view email]
[v1] Mon, 5 Feb 2024 06:06:47 UTC (3,109 KB)
[v2] Thu, 25 Jul 2024 09:16:05 UTC (3,132 KB)

Computer Science > Computation and Language

Title:KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators